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The random XORSAT problem deals with large random linear systems of Boolean 
variables. The difficulty of such problems is controlled by the ratio of number of 
equations to number of variables. It is known that in some range of values of this 
parameter, the space of solutions breaks into many disconnected clusters. Here 
we study precisely the corresponding geometrical organization. In particular, the 
distribution of distances between these clusters is computed by the cavity method. 
This allows to study the 'x-satisfiability' threshold, the critical density of equations 
where there exist two solutions at a given distance. 
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I. INTRODUCTION 

Constraint Satisfaction Networks (CSN) are problems involving many discrete variables, 
with values in a finite alphabet, related by low density constraints: each constraint involves 
a finite number of variables. This kind of problems arise in many branches of science, from 
statistical physics (spin or structural glasses [1]) to information theory (low-density parity- 
check (LDPC) codes [2, 3]) and combinatorial optimization (satisfiability, colouring [4]). The 
'thermodynamic limit' of such problems is obtained when the number of variables and the 
number of constraints go to infinity, keeping their ratio, the density of constraints a, fixed. A 
lot of attention has been focused in recent years on the study of random CSN, both because of 
their practical interest in coding, and also as a means to study "typical case" complexity (as 
opposed to the traditional worst case complexity analysis). Many CSN are known to undergo 
a SAT-UNSAT phase transition when the density of constraints increases: there is a sharp 
threshold separating a SAT phase where all constraints can be satisfied with probability one 
in the thermodynamic limit from an UNSAT phase where, with probability one, there is no 
configuration of the variables satisfying all the constraints. While the existence of a sharp 
threshold has been proved by Friedgut [5] for satisfiability and colouring, there is no yet any 
rigorous proof of the widely accepted conjecture according to which the threshold density 
of constraints converges to a fixed value Oc in the thermodynamic limit. 

Recent years have seen the upsurge of statistical physics methods in the study of CSN. 
In particular, the replica method and the cavity method have been used to study the phase 
diagram [6-8]. Their most spectacular results are some arguably exact (but not yet rigorously 
proved) expressions for ac, and the existence of an intermediate SAT phase, in a region 
of constraint density ]arf,ac[, where the space of solutions is split into many clusters, far 
away from each other. This clustering is an important building block of the theory: it 
is at the origin of the necessity to use the cavity method at the so-called one-step replica 
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symmetry breaking (IRSB) level; this method can be seen as a message-passing procedure 
and used as an algorithm for finding a SAT assignment of the variables. This algorithm, 
called survey propagation, turns out to be very powerful in satisfiability and colouring, 
and its effectiveness can be seen as one indirect piece of evidence in favour of clustering. 
On intuitive grounds, clustering is often held responsible for blocking many local search 
algorithms [9]. Although there does not exist any general discussion of this statement, this 
phenomenon was thoroughly investigated in the case of XORSAT [23]. 

The clustering effect can be studied in a more formal way by introducing the notion 
of ^-satisfiability [10, 11]. A CSN with variables is said x-satisfiable (x-SAT) if there 
exists a pair of SAT assignments of the variables which differ in a number of variables 
G [Nx — e{N), Nx + e{N)]. Here x is the reduced distance, which we keep fixed as goes to 
infinity. The resolution e(A^) has to be sub-linear in A^: limTv^oo g{N)/N = 0, but its precise 
form is unimportant for our large A^ analysis. For example we can choose e(A^) = V^- 
For many random CSN, it is reasonable to conjecture, in parallel with the existence of a 
satisfiability threshold, that ^-satisfiability has a sharp threshold a^x) such that: 

• if a < ac{x), a random formula is x-SAT almost surely. 

• if a > ac{x), a random formula is x-UNSAT almost surely. 

This conjecture has been proposed for /c-satisfiability of random Boolean formulas where each 
clause involves exactly k variables with k > 3. So far only a weaker conjecture, analogous 
to Friedgut's theorem [5], has been established [11]. It states the existence of a non-uniform 
threshold ai^\x). Rigorous bounds on adx) have been found in [11] for the ^-satisfiability 
problem with k > 8, using moment methods developed in [12], but so far this x-satisfiability 
threshold has not been computed. 

In this paper we compute the x-satisfiability threshold a^x) in the random XORSAT 
problem using the cavity method. This is a problem of random linear equations with Boolean 
algebra. It is important because many efficient error correcting codes are based on low- 
density parity-checks, the decoding of which involves precisely such linear systems. It is 
also one of the best understood case of CSN. In particular, efforts to extend the replica 
method [13] and the cavity method [14] to deal with models defined on finite-connectivity 
lattices, have resulted in the first exact (but non-rigorous) derivation of its phase diagram 
[15]. Later, a clear characterization of these clusters, combined with simple combinatoric 
arguments, gave a rigorous base to these predictions [16-18]. These works have computed 
the phase diagram in details and provide expressions for the two thresholds < < 1. 

Our computation of adx) confirms this known structure, and it also provides insight into 
the geometrical structure of clusters. We find that a^x) is non monotonic (see fig. 5), which 
confirms the existence of gaps in distances where there does not exist any pair of solutions. 

The method used in our computation is in itself interesting. It turns out that it is not 
possible to compute a^x) directly, by fixing x and varying a. Instead, we work at a fixed 
value of a and introduce a probability distribution for pairs of SAT assignments, where the 
distance between the solutions plays the role of the energy. The computation of the entropy 
as a function of the energy, and more precisely the computation of the energies where 
it vanishes, then allows to reconstruct adx). Our computation thus involves a mixture of 
hard constraints (the fact that the two assignments must satisfy the XORSAT formula), and 
soft constraints (the Boltzmann weight which depends on their distance). This is refiected 
in the structure of the cavity fields that solve this problem. 
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The remainder of this paper is organized as follows. The next section introduces some 
notations. In section III, we analyse classical Survey Propagation on XORSAT and show 
its equivalence with the "leaf removal" [18] or "decimation" [16] algorithm. This analysis 
allows to re-derive the phase diagram of XORSAT, and sets up useful notations and concepts 
for later computations. In section IV we perform a statistical mechanics analysis of weight 
properties in a single cluster using the cavity method. Section V applies this formalism to 
the computation of the cluster diameter, while section VI is devoted to the evaluation of 
inter-cluster distances. In section VII we sum up and discuss our results. 

II. NOTATIONS AND DEFINITIONS 

A XORSAT formula is defined on a string of variables xi, X2, ■ ■ ■ , xn G {0, 1} by a set 
of M parity checks of the form: 

^ Xi = ya (mod 2), for all a = 1, . . . , M (1) 

i(iV{a) 

where Ua G {0, 1}. Here V{a) C {1, . . . , A^} is the subset of variables involved in parity check 
a. Later on i G a shall be used as a shorthand for i G V{a). 
Eq. (1) can be rewritten in the matricial form: 

Ax = y (mod 2), A = {Aia}i(z[N],ae[M] (2) 

where Aia = 1 if i G a and Aia = otherwise. The pair F = {A, y) defines the formula. Such 
a linear system can be solved in polynomial time by Gaussian elimination. If a formula has 
solutions, it is SAT; otherwise, it is UNSAT. The thermodynamics limit is A^ — > oo, M — > oo 
with a fixed density of constraints a = M/N. 

In this paper we specialize to random /c- XORSAT formulae, where each equation involves a 
subset of k variables, chosen independently with uniform probability among the (^) possible 
ones, and each t/a independently takes value or 1 with probability 1/2. One important 
characterization of a XORSAT formula F = {A,y) is the number A/7v(-F) of assignments 
of the Boolean variables x which satisfy all the equations, and the corresponding entropy 
density 

s^iF) = ^\og^^J,{F) (3) 

Logarithm is base 2 throughout the paper. Using a spin representation o"j = (—1)^% the 
/c-XORSAT problem can also be mapped onto a spin glass model where interactions involve 
products of k spins (the variables (—1)^" then play the role of quenched random exchange 
couplings) [15], and the question of whether a formula is SAT is equivalent to asking whether 
the corresponding spin-glass instance is frustrated. 
Previous work [15-18] has shown that: 

• for a < ad{k), the formula is SAT, almost surely (i.e. with probability — > 1 as 
A^ — > oo). The solution set forms one big connected component, the entropy density 
concentrates at large A^ to (A^ — M)/N = 1 — a ; This phase is called the EASY-SAT 
phase. 
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• for ad{k) < a < adk), the formula is still SAT almost surely, but the solution set 
is made of an exponentially large (in A^) number of components far away from each 
other (in the following we shall give a precise definition of these clusters) ; The entropy 
density also concentrates at large TV to (TV — M)/N = 1 — a. This is the HARD-SAT 
phase. 

• for a > ac{k) (with ac{k) < 1), the formula is UNSAT almost surely. The entropy is 
— oo. This second transition is the usual SAT-UNSAT transition. 

The fact that, throughout the SAT phase (a < adk)), the entropy density concentrates 
to 1 — a is not surprising: it can be understood as the fact that matrix A has rank M almost 
surely in the SAT phase. The intuitive reason is that, each time there exists a linearly 
dependent set of checks, the choice of i/a has probability 1/2 to lead to a contradiction. So 
the rank of A cannot differ much from M in the SAT phase. From the point of view of 
linear algebra, the existence of the clustered phase, i.e. the fact that the vector subspace of 
SAT assignments breaks into disconnected pieces, is more surprising, as is the discontinuity 
of S]sf{F) at the transition Oc- These two aspects are in fact related: the quantity which 
vanishes at the SAT-UNSAT transition is actually the log of the number of clusters of 
solutions, while each cluster keeps a finite volume. 

We will study the geometric properties of the space of solutions for random A;-XORSAT 
in the HARD-SAT phase using the notion of x-satisfiability. In terms of solutions of linear 
equations, we want to know if there exist two Boolean vectors x and x' which both satisfy 
Ax = Ax.' = y, where the Hamming distance cix.x' = (x — x')^ = Nx. Clearly, if such a pair 
exists, x — x' is solution to the homogeneous ('ferromagnetic') problem where y = 0: 

A{x - x') = (4) 

Therefore, a formula F = {A, y) is x-SAT if and only if F is SAT and if there exists a solution 
x to the homogeneous system Ax = of weight dx.o ~ Nx (the weight is by definition the 
distance to 0). Note that for x = 0, this second condition is automatically fulfilled, and 
x-satisfiability is equivalent to satisfiability. This linear space structure also implies that 
the set of solutions looks the same seen from any solution in the SAT phase: the number of 
solutions at distance d of any given solution xq is independent from xq. 

Distance properties can also be investigated directly by evaluating extremal distances 
between solutions. To that end we define three distances: (a) the cluster diameter di, i.e. 
the largest Hamming distance between solutions belonging to the same cluster; this diameter 
is independent of the cluster; (b) the minimal and maximal inter-cluster distances d2 and d^, 
i.e. the smallest (resp. largest) Hamming distance between solutions belonging to distinct 
clusters. All three distances are assumed to be self-averaging in the thermodynamic limit 
of the random problem: xi(a) = di/N, X2{a) = d2/N and x-si^a) = d^/N shall denote the 
corresponding limits. In the particular case where k is even, the formula is invariant under 
the transformation x ^ x+1 (mod 2), which is reflected in terms of distances by a symmetry 
with respect to x = 1/2: x ^ 1 — x. A direct consequence is that x^^a) = 1 — X2(a), and 
that a fourth weight, defined as 1 — Xi{a), will also come into play. These distance functions 
are related to the x-satisfiabihty threshold as follows: at fixed a, a formula is x-SAT almost 
surely iff 

• X G [0,Xi(a)] U [x2(a), X3(a;)] when k is odd. 

• X G [0, Xi(a)] U [x2(a), 1 — X2(a)] U [1 — Xi(«), 1] when k is even. 
We will now compute xi,X2,X3 with the cavity method. 
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III. LEAF REMOVAL AS AN INSTANCE OF SURVEY PROPAGATION 

XORSAT formulae are conveniently represented by factor graphs, called Tanner graphs, 
in which variables and checks form two distinct types of nodes, with the simple rule that 
the edge {i, a) between i and a is present if i G a. 

An example of a Tanner graph and its associated linear system is shown below: 



(a) Xi + X2 + Xs = (mod 2) 

(b) X2 + x-s = 1 (mod 2) 

(c) X2 + X3 + X4 = 1 (mod 2) 

The number of variables involved in a check a, denoted by |y(a)|, is the degree of a in 
the factor graph. Here we study /c-XORSAT where this degree is fixed to k. Similarly, if 
V{i) denotes the set of parity checks in which i is represented, |^(^)| is the degree of i in the 
factor graph. The degrees of checks are commonly referred to as right- degrees, and those of 
variables as left-degrees. The infinite-length (thermodynamic) limit is obtained by sending 

and M to infinity while keeping the ratio a = M/N fixed. In this limit, the distribution 
of left-degrees is a Poisson law of parameter ka: The probability of a variable having degree 
i is Ukai^), where nx{i) = exp{—x)x^/i\. 

Here we use the leaf removal algorithm (LR) in order to obtain a precise definition of the 
notion of "cluster" or "component" of solutions, one which is valid also for finite N. The 
algorithm proceeds as follows: pick a variable of degree one (called a leaf), remove it as well 
as the only check it is connected to. Continue the process until there remains no leaf. The 
interest of this algorithm is easily seen: a variable on a leaf can always be assigned in such 
a way that the (unique) check to which it is connected is satisfied. 

The linear system remaining after leaf removal is independent of the order in which leaves 
are removed. It is called the core. A 'core check' is a check which only involves core variables. 
If the core is empty, the problem is trivially SAT. In general, given a solution of the core, 
one can easily reconstruct a solution of the complete formula by running leaf removal in the 
reverse direction, in a scheme which we refer to as leaf reconstruction. In this procedure, 
checks are added one by one along with their leaves, starting from the core. If an added 
check involves only one leaf, the value of that variable is determined uniquely so that the 
check is satisfied. If the number of leaves k' is greater that 1, one can choose the joint value 
of those leaves among 2^ ~^ possibilities. The process is iterated until the complete factor 
graph has been rebuilt. Given a core solution, one can construct many solutions to the 
complete formula. Variables which are uniquely determined by the core solution are called 
frozen, and variables that can fluctuate are called floppy. Of course, by definition, the frozen 
part includes the core itself. A core solution defines a cluster. All solutions built from the 
same core solution belong to the same cluster. We shall see later how this definition fits in 
the intuitive picture that we sketched previously in terms of connectedness. 

We propose here an alternative to the leaf removal algorithm, which also builds the core, 
but keeps actually more information. The approach is inspired by the cavity method, and 
is a special instance of Survey Propagation (SP) [7]. To each edge {i,a), one assigns two 
numbers rhl^_^^ and mj^^ belonging to {0, 1}, updated as follows: 

• At t = 0, = 1, m°_^„ = 1 for all edges {i, a). 
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j for all (z, a), 

Here a E i is a. shorthand for a G V{i). 

The interpretation of ml__^^ = 1 is: "variable z is constrained at time t in the absence 
of check a", and ttlI^^- = 1: "check a constrains variable i at time t" . One also defines 
M* = 1 — naGj(-'- ~ ^ {0' -'-}• This number indicates whether node i is constrained at 

time t {Ml = 1) or not (M* = 0). 

At t = 0, all variables are constrained. The algorithm consists in detecting the under- 
constrained variables, and propagating the information through the graph to simplify the 
formula. At the first step, only variables of degree one are affected: if i is of degree one and 
is connected to a, mj_,^ = 1 — Yl^ = 0. This, in turn, gives freedom to a, which no longer 
constrains its other variables: rh\^j = 0, for j E a — i. This effectively removes a and i from 
the formula, just as in the leaf removal algorithm. In the subsequent steps of the iteration, 
will be considered as a leaf (in the LR sense), a variable i such that there exists exactly one 
a G z such that rha^i = 1. In that case we have ml'^^ = 0, thus implementing a step of LR. 

Let us add a word about the term "Survey Propagation" we have used so far. Analysis of 
the IRSB cavity equations at zero temperature [18] (see [7] for a more complete discussion 
in the case of /c-SAT) shows that cavity biases fall into two categories, depending on the 
edge we consider: either a warning is sent (compelling to take value or 1 depending on the 
cluster, with probability one half for each), or no warning is sent. (In more technical terms, 
the survey propagation reduces to warning propagation). The first situation corresponds in 
our language to rha^i = 1 and the second to rha^i = 0. Similarly, we have mi_a = 1 if the 
cavity field is non-zero, and rrii^a = otherwise. Therefore our algorithm carries the same 
information as Survey Propagation. 

The interest of SP over leaf removal is that it keeps track of the leaves which are uniquely 
determined by their check. For example, if two or more leaves are connected to the same 
check a at time t, at time t + 1 one has fh*,^- = for all i E a, reflecting the fact that 
a cannot uniquely determine the value of several leaves. Conversely, if a is connected to a 
unique leaf i and if one has: m^j^^ = 1 for all j G a — i, then one gets = 1, reflecting the 
fact that, the variables {xj}j^a~i being fixed in the absence of a, i is determined uniquely. 

A little reasoning shows that when the algorithm stops (t = t/), z is frozen iff M*^ = 1, and 
i belongs to the core iff there exists at least two checks a,b E i such that 'rha->i = "^b-^i = 1- 
In the final state, we say that the directed edge z — > a is frozen if nii^a = ^/~ta — 1' 
that a — s> i is frozen if rha^i = ^a^i = 1. In the opposite case, edges are called floppy. This 
version of SP is strictly equivalent to the Belief Propagation algorithm used for decoding 
Low-Density Parity-Check codes on the binary erasure channel, also called "Peeling decoder" 
in that context. 

SP can be studied by density evolution in order to derive the phase diagram, as in [18]. 
Let us briefly survey this study for completeness. The statistics of messages at time t is 
described by two numbers: 



• '^•la = l-mg.-a(l- 

• Stop when m^'ljj = m*_ 
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FIG. 1: A example of a fixed point of SP. Circles represent variable nodes, and squares check nodes. 
An arrow means that message m or m has value 1, that is, that the directed edge is frozen when 
SP stops. Leaf removal propagates null messages from the outer leaves down to the core, while 
"leaf reconstruction" propagates non-null messages from the core up the frozen part. 



where the sums run over all edges of the Tanner graph. When N ^ oo, these densities are 
governed by evolution equations: 



^Trka{^){w^Y = exp [-ka{l 
e 



w 



(6) 



which are initialized with w° = w° = 0. These equations are exact if the Tanner graph 
is a tree. In our case the graph is locally tree-like (it is a tree up to finite distance when 
seen from a generic point), and one could set up a rigorous proof of (6) using the methods 
developed in [19]. 

The fixed point of these equations is given by the cavity equation: 



w 



{1 



fc-1 



Setting A = ka{l — w), Eq. (7) can be rewritten as: 



(7) 



(8) 



When a < ad, the unique fixed point is A = (i.e. w = 1). This means that the core is 
empty. For a > ad however, there remains an extensive core of size 



Nc = N 



l>2 



N[l-{1 + X)e-^] 



(9) 
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while the number of frozen variables is 



Nf = N 



W 



N[l- e"^] (10) 



.£>2 

The number of core checks is: 

M^ = Mil-vf = aNll-e-^f . (11) 

The left-degree distribution (with respect to core checks) inside the core is given by a 
truncated Poissonnian: 

m = ^,-^^^iii>2), (12) 

where I is the indicator function. 

One can show that the leaf removal algorithm conserves the uniformity of the ensemble. 
Therefore, the core formula is a random XORSAT formula with right-degree k and left- 
degree distribution Pc(^) given by (12). The number of solution to such a formula is known 
to concentrate to its mean value when the size goes to infinity [17, 18]. In the case of the 
core formula, this number is simply 2^^^"*^= if > M^, and otherwise. Recalling that 
the complete formula has solutions if and only if the core formula does, we find that the 
SAT-UNSAT threshold ac is given by the equation: 

l-(l + A)e-^ = a[l-e-^]\ (13) 

The number of clusters is characterized by the complexity or configurational entropy, that is 
the logarithm of the number of core solutions: 

S(a) = ^ log(# clusters) = ^^^^j^ = 1 - (1 + A)e-^ -a[l- e"^] ' (14) 

We recall that the group structure of the solution set implies that all clusters have the same 
internal structure. Their common internal entropy is therefore given by: 



'inter 



l-a-E(a) (15) 



where we have used the fact that the total entropy is 1 — a. 

Let us comment on the relationship between our definition of clusters and the more tra- 
ditional one. Usually, clusters are defined as the "connected" components of the solution 
set, where connectedness is to be understood in the following way: two solutions are con- 
nected if one can go from one to the other by a sequence of solutions separated by a finite 
Hamming distance (when oo). To make contact with our own definition of clusters, 

one needs to prove two things. First, that two solutions built from the same core solution 
are connected. Second, that two core solutions are necessarily separated by an extensive 
Hamming distance (> cN , with c constant), which implies that two solutions built from two 
distinct core solutions are not connected. Both proves can be found in [18]. This reconciles 
our definition (which holds for any single instance of XORSAT) with the usual one (which 
only makes sense for infinite-length ensembles). 
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IV. DISTANCE LANDSCAPE: THERMODYNAMICAL APPROACH 

As we have already observed, studying pairs of solutions is equivalent to studying solutions 
to the ferromagnetic problem. Indeed, if 5" denotes the affine subspace of solutions to 
Ax = y, and 5*0 the vector subspace of solutions to Ax = 0, we have: 

5 X 5 = {(x',x' + x),(x',x) G 5 X 5o} (16) 

In particular, distances in S are reflected by weights in 5*0. Therefore, in order to study the 
range of attainable distances between solutions, one just needs to study the range of possible 
weights in Sq. To that end we set a thermodynamical framework in which the weight plays 
the role of an energy: 

i?(x) = |x| = 5^5,,,i. (17) 

i 

The Boltzmann measure at temperature is thus defined by: 

P(x, /5) = ^ n o] 2-^1-1 (18) 

where the normalization constant Z{f3) is the partition function. The Dirac delta-function, 
here defined on the two-element field F2, enforces that only configurations of So are consid- 
ered. Remarkably, this measure is formally similar to the one used to infer the most probable 
codeword under maximum-likelihood decoding in Low-Density Parity-Check (LDPC) codes 
on the Binary Symmetric Channel [20]. In fact, as we shall see soon, some of the methods 
used to solve both problems share common aspects. 

A very useful scheme for estimating marginal probabilities in models defined on sparse 
graphs is the cavity method [14], which we have already mentioned in the previous section. 
Let Pi^a be the probability that Xi = x under the measure defined by (18), where the link 
(i, a) has been removed. The replica symmetric (RS) cavity method consists in comput- 
ing the cavity marginals Pi^a (viewed as variable-to-check messages) using a closed set of 
equations where check-to-variable messages are also introduced as intermediate quantities. 
These second-kind messages are denoted by q^^^ and are proportional to the probability 
that Xi = X when i is connected to a only. Messages are updated until convergence with the 
following rules: 

^--^ = Z~Ii ^2-^^-'^ (19) 

b£i—a 



E IlP?^aSFjj2''^^A (20) 

{xj}jiza-ij&a.-i Vjea / 



where Zi^a is a normalization constant. When convergence is reached, marginal probabilities 
are obtained as: 

pT ^ P(^' = ^ n ?a%2-^^-^^ (21) 

where Zj+aei is also a normalization constant. Continuing the analogy with codes, it is 
interesting to note that these cavity equations are identical [21] to the Belief Propagation 
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(BP) equations [22] used to decode messages with LDPC codes on the Binary Symmetric 
Channel. 

It turns out that cavity equations (19), (20) do not admit a unique solution, as one would 
expect if the system were replica symmetric. Instead, let us show that they admit exactly 
one solution for each cluster. In a given cluster denoted by c, let us denote by q the value 
of a frozen variable i. There exists a solution to (19), (20), where, for every frozen variable 
i: 



Pi- 
ll- 



if i - 
if a 



a frozen 
> i frozen, 



(22) 



In order to show that this is a solution, let us use the SP messages, which provide 
information on how the fixing of the core solution forces the values of frozen variables. For 
example nii^a = 1 indicates that Xi is entirely determined by the core solution, supposing 
that the edge {i, a) has been removed. Consider the SP fixed point relations 



n 



rui^a = 1 - n (1 - ^b^i) ■ 

b£i—a 



(23) 



They are in fact contained in the cavity equations Eqs. (19), (20). In fact, the iteration of 
cavity equations allows to identify the frozen edges, irrespectively of the cluster the system 
falls into. 

But the cavity equations also contain 'fluctuating' messages, where and are in ]0, 1[, 
which are de facto restricted to the floppy part. We parametrize them by the cavity fields 
and biases: 



Phi 

which satisfy the equations: 



log 



1 ; 



pul^i = log 



91 



hl^^ = ul^i + 1 with i a floppy. 



(24) 



(25) 



bai—a 



2 atanh 



n tanh(/?/.^^^„/2) n 



with a ^ i floppy (26) 



Note that cavity messages hl_,^ and u^^^ now depend explicitly on the considered cluster, 
and are uniquely determined by it. 

The multiplicity of solutions to RS cavity equations is a clear sign that the replica symme- 
try is broken. The main lesson from this discussion is that solutions can fluctuate according 
to two hierarchical levels of statistics: the first level deals with fluctuations inside a single 
cluster, i.e. fluctuations on the floppy part, while the second level deals with the choice of 
the cluster. The reduced cavity equations (25), (26) correctly describe the first level^, when 



^ Although the RS Ansatz is unable to describe the whole system, it can reasonably be assumed to be valid 
on a single cluster. 
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the system is forced to hve in cluster c. This leads to defining a new probability measure 
and partition function, restricted to c: 

= J]2-^S-i'5--i (27) 

By construction, this system is characterized by the fixing of the frozen edges (22) and by 
the reduced cavity equations (25), (26). The second level of statistics, i.e. the statistics 
over the clusters, is appropriately handled by an IRSB calculation, and will be the object 
of section VI. We first focus on the properties of single clusters under the measure defined 
by (27). 

The cavity method comes with a technique to estimate the log of the partition functions, 
also called potential in our case: 

</.(/?) = -i log Z(/3) (28) 

(Note that this quantity differs from the usual free energy by a factor /3). It can be computed 
within the RS Ansatz by the Bethe formula [21]: 

Nm = Yl -{k-l)J2 ^4>a (29) 

i a 

where 

A<P,+ae^ = -logZ,+,e^ = -log5^n^^-2-^^^»'^ 

(30) 



A0. = -log J2 UP^^a^^. (E^^'O) 



This formula has a rather simple interpretation: A(j)i^a& is the contribution of i and its 
adjacent checks to the potential. When these contributions are summed, each check is 
counted k times, whence the need to subtract k — 1 times the contribution of each check 
A(f)a- Also note that this expression is variational: it is stationary in the messages {pi^a} 
as soon as the cavity equations (19), (20) are satisfied. 

The RS Ansatz is valid in a single cluster. The single cluster potential (pdP) = 
— ^logZc(/5) can therefore be computed by plugging Eqs. (22), (25) and (26) into the 
Bethe formula (30), provided one uses the messages corresponding to one given cluster c. 
When one is restricted to a single cluster c, the range of possible weights is [xc^c]- The 
minimal and maximal weights can be obtained by sending j3 ±oo. For j3 oo, the 
second cavity equation (26) simplifies to: 



^[ n ''Ua n (-l)'M ,.,^if_J^-«l with a z floppy (31) 



where S{x) = 1 if x > 0, — 1 if x < and if x = 0. 

The "ground state energy" , i.e. the minimal weight in c, is obtained as: 

N 

X, 



i floppy i frozen 
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The (3 ^ —oo limit yields very similar equations. These equations will be analyzed in the 
next section. 

Let us also write down the equations giving the potential, which will be used in sect. VI. 

NUP) = ^<PUae^ - - 1) E (33) 

i a 

lim ^A(/.^+,g, = Ax^+,g„ hm ^A^^ = Ax^ with (34) 

Ax^+,g, = UYI I + 1 - 1 1] "a->. + 1 1 ) if ^ is floppy (35) 
^Xi+a& = Yl i^ ^ i^ ^ozeu and q = (36) 

Axt+a^, = l+Y K^^\^^<->^) '^^ ' ^o^en and c, = 1 (37) 

= ^ ( - n hUaii^-^y') j^^, i^-'^i (38) 

V. DIAMETER 

With our formalism, computing the cluster diameter boils down to computing the max- 
imal weight in cluster (the cluster containing 0). The relevant partition function for this 
task is: 



xGO \ iGa / 



i=l 



(39) 



When (3 — oo, the solution of the cavity equations corresponding to cluster is charac- 
terized by: 

Pi^a = ^x,o iii^ a frozen, 
(la~*i = ^x,o a a ^ i frozen, 

hi^a = Ub^i + 1 if i ^ a floppy, 

bi^a (40) 



n (-^^■-) 



min \hj^a\ if a — i floppy 



and the maximum weight di is given by: 

* = lim BMP) = f + (41) 

j floppy 

These equations are presented for single XORSAT formulae, and can be solved by simple 
iteration of the corresponding message-passing rules. In practice however, in the regime 
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FIG. 2: Diameter of a cluster of solutions. When one decreases a below all clusters aggregate 
into one big cluster, thus explaining the discontinuity. 



where a is near (but smaller than) a^, one does not always reach convergence. This is 
arguably due to the hard nature of XORSAT constraints, as it was pointed out in [23]: 
as one nears the dynamical transition, hopping from one solution to the other requires an 
increasing (yet sub-extensive) number of changes, making the sampling of solutions difficult. 
To circumvent this problem, we can work directly in the infinite-length limit by considering 
the probability distribution functions (pdfs) of each kind of message: 



{i,a) 



(42) 



(i,a) 

When N ^ oo, self-consistency equations for these distributions read: 

P{h) =J2'^kaM I n ^(^-)'^ ( ^ - - M 

I a=l \ a=l / 

Q{u) = - E f ^ T %\l - vf-'-^ [ n dh, P{h,)5 



i=l ^ ^ >' j=l 

and one has 



+ s[\{{-h,)]imn\h 



1 1 



(43) 



xM) = lini I = e-^ / dhP{h) i±|i^ (44) 

Af^oo I\ J Z 

These equations can be solved with a population dynamics algorithm [14]. In Fig. (2), we 
represent the maximal diameter X\ clS Sb function of a. 
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FIG. 3: Pictorial representation of the clustered space of solutions around in the A^-dimensional 
hypercube. For a cluster c, the minimal and maximal distances Xc and Xc are depicted. 

VI. MINIMAL AND MAXIMAL DISTANCES BETWEEN CLUSTERS 

In section IV we have set up the formalism for computing the minimal and the maximal 
weights in a given cluster c using the cavity method. In order to evaluate the minimal and 
maximal weights in all clusters expect 0, we resort to a statistical treatment of the cavity 
equations. This scheme is known as the IRSB cavity method in the replica language. We 
first specialize to the case of minimal weights, the other case being formally equivalent. We 
already know that the number of clusters grows exponentially with A^. Here we further 
assume that the number of clusters with a given minimal weight Xc is exponential in N, and 
we define the complexity 

^5(x,a;e) = 2^^'"("\ (45) 

c^O 

To this quantity we associate the IRSB potential 

2Nipm(y) ^ ^2-^?^^<= = dx 2^(^'"(^')"?'^). (46) 

When is large, a saddle-point evaluation of this quantity yields: 

ipmiy) = min [yx - S„(x)] = yx* - S„,(x*) with y = (9^.S^(x*) (47) 

X 

Ipmiy) is thus related to Sm(x) by a Legendre transformation. In terms of statistical me- 
chanics, m is an inverse temperature coupled to the "energy" Xc, the complexity plays the 
role of a micro-canonical entropy, and the potential is equivalent to a free energy, up to a 
factor m. The minimal weight in all clusters (expect 0) is given by the smallest x such that 
^m{x) > 0. Our goal is now to compute ipmiy), infer Sm(x) by inverse Legendre 

transformation. 

We proceed to the statistical analysis of the cavity equations under Boltzmann measure 
2-Nyxc_ This amounts to writing IRSB cavity equations, where messages are distributions 
of RS messages over all clusters. The distribution of messages on floppy edges is described 
by the two pdfs: 

P'^'^ih) = m,hUa)) (48) 
Q'^-\u) = (5(u,<_)). (49) 
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The average (■) is performed with the aforementioned measure on clusters, with the imphcit 
assumption that the edge {i, a) has been removed. On frozen edges, messages are trivial, 
but their values depend on the considered cluster. We thus define for frozen edges: 



^0 



— i — _rg 



(5(P°-.a,l)> PI 



(50) 
(51) 



In order to write a closed set of equations for these probability distributions, we need 
to know how the Boltzmann weight 2"^^^"= biases the message-passing procedure: when a 
field hi^a is estimated as a function of its "grand-parents" {{hj^b}, j ^ b — i, b E i — a), 
a re- weighting term 2~^^^'»^<^ is associated to it [7, 14], where Axi^a is the contribution of 
i and its adjacent checks (except a) to the total weight. This contribution is obtained as 
Axj+agi in Eq. (35)-(37), but with a removed. 

The IRSB cavity equations read: 



a frozen: 



z.. 



b&f-a bei^f-a 



pi^a^J_ -Q Qb-^^ J JJ dMb^ig^^'(M6^i)2-^(^+^66."/-al«'>-l'?{«'>-)) 



b&f-a 



b£i^S -a 



a floppy: 



P*-«(/i)=^_ I jj dM6_.ig^^*(Mb_.,)2-^/2(^''e'-i"''-i+^-i^''e.-a«.-..+ii) 



bai—a 



X 



5\h-l - ^ Ub^i j 

\ b&i-a J 



(here and in the previous equations Zi^a is a normalization constant) 
a i frozen: 

a ^ i floppy: 



1 + u,ea-^i'2pr'' - 1: 



g"^'(n) 



{cj=0,l} j£af-i 
j£af —i 



X S 



u 



s[ n n (-i)'M .^iriv 

I -L-L J. J. I jgan/_. 



The potential ipni{y) is obtained by a Bethe-like formula [7]: 

NllJmhj) = ^^^+a& - (/C - 1) 



(52) 



(53) 



(54) 



(55) 



(56) 
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X 

FIG. 4: Minimal and maximal distance complexities as a function of the reduced distance rr, for 
k = ?,,N = 10000 and M = 8600. 



with 



log 



1 + n.ea(2i^o' 



if a G core 



-log E n^r/ n d/l.^aP^-^'^(/^.^a) 
{ci=0,l} jea/ iea"/ 



(57) 



X exp 



mill \hi 



otherwise 



where Zi^aei is defined as Zi^a but in the presence of a. 

Like in the diameter calculation, IRSB cavity equations can be interpreted as message- 
passing update rules, with the difference that messages are now surveys over all clusters. 
The output of that procedure is the minimal distance complexity Tim{x), obtained as the 
inverse Legendre transform oiipm{y)- We refer to the corresponding algorithm as "distance 
survey propagation". The same procedure can be implemented in the /5 —oo limit, and 
yields the maximal distance complexity: 



(58) 



c^O 



where Xc is the maximal weight in cluster c (see Fig. 3). Note that in the particular case 
where y = 0, which corresponds to a uniform measure over the clusters, classical SP is 
recovered for both versions of the algorithm (minimal and maximal distance): in that limit 
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X 



FIG. 5: Phase diagram of the 3-XORSAT problem in the (x, a) plane. The cluster diameter 
(□), as well as minimal (+) and maximal (x) distances between solutions of distinct clusters, are 
represented. The thick line is the x-satisfiability threshold. 



we have Qg^* = Pq~^°' = 1/2, and the calculation of ipm{0) and ipAiiO) gives back — S(a), 
the total complexity (14), as expected. 

The practical implementation of distance-SP demands particular care when small dis- 
tances are considered: it turns out that distance complexities and are not 
concave, which entails that the functions ipmiv) and ipuill) are multivalued in a certain 
range of y. A way to circumvent this problem (already used in [24]) is to keep the weight 
X = dyipmiy) fixed after each iteration, and to deduce y accordingly. Here is how the 
algorithm proceeds for a given reduced weight x: 

1. Run classical SP. 

2. Initialize all floppy and frozen messages {Pj^a}, {Qa^i} to random values. Choose a 
(reasonable) value for y. 

3. Until convergence is reached, do: 

• Update all a — s> i messages {Qa^i}, and then alH — > a messages {Pi^a} at inverse 
temperature y. 

• Find y such that x = dyipmiv, {Pi->a}, {Qa^i}) by the secant method, {Pi^a} and 
{Qa^i} being fixed. 

4. Compute ipmiy, {Pi^a}, {Qa^i}) as well as its derivative and deduce T,„i{x) = yx — 

My)- 

Note that since the messages are pdfs themselves, the update of each of them in step 3 is 
performed by a population dynamics sub-routine. 

Fig. 4 shows the minimal and maximal weight complexities T,m{x) and T,m{x) for a 
random 3-XORSAT formula with = 10000 and M = 8600. These complexities can 
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be regarded as kinds of weight enumerator functions for clusters. Their fluctuations from 
formula to formula can be significant (15%), even for large system sizes (A^ = 10000). 

An average version (density evolution) of distance-SP can also be implemented for random 
fc-XORSAT, in the same spirit as Eq. (43). Such a computation involves distributions (on 
edges) of distributions (on clusters), and can be solved by population dynamics, where 
each element of the population is itself a population. The zeros of ^^(x) and S^f (x) thus 
obtained yield the minimal and maximal inter-cluster distances X2(«) and x^la), respectively, 
as shown in Fig. 5. Together with the cluster diameter xi{a) computed in section V, these 
values are used to construct the x-satisfiability threshold. 

Our algorithm can in principle be run on any system of Boolean linear equations, and is 
expected to give reasonable results provided that the loops of the underlying Tanner graph 
are large. The case of LDPC codes is of particular interest because it allows several simpli- 
fications and has been extensively studied from both the combinatorial [25] and statistical 
physics [24, 26] point of view. LDPC codes are homogeneous Boolean linear systems where 
parity checks and variables may have arbitrary degree distributions, with the restriction 
that variables should always have degrees no less than 2. This implies that the leaf removal 
algorithm is inefficient on such linear systems: all variables belong to the core, and are 
frozen. In particular, each cluster is made of one unique solution: the cluster diameter is 0, 
and the minimal and maximal inter-cluster distances coincide. Their common complexity 
^m{x) = T,m{x) is often called 'weight enumerator exponent' and is an important property 
of ensembles of codes. Translated into our formalism, this means that all messages are 
frozen, and the distance-SP algorithm simplifies dramatically: 

pr'' = ^ n ^0^'' ^r" = ^ n (59) 

b&J ~a oGtJ —a 

Not surprisingly, the density evolution analysis of this simplified algorithm yields the 
same equations as those obtained with the replica method in [24, 26]. 



VII. CONCLUSION AND DISCUSSION 

We have applied the cavity method to estimate extremal distances between solutions 
of random linear systems with large girth in the clustered phase. Our results are used to 
compute the x-satisfiability threshold of the random /c-XORSAT problem. The notion of x- 
satisfiability, which tells us whether one can find a pair of solutions separated by a Hamming 
distance x, was introduced in the context of another constraint satisfaction problem, /c-SAT, 
where it was used to give rigorous evidence in favor of the clustering phenomenon [10]. 

Although A;-XORSAT is a rather simple problem, it displays a very similar phase diagram 
as harder problems such as /c-SAT or g-colorability. In particular, its clustered phase is well 
defined and understood. That said, finding extremal distances in the solution space of linear 
Boolean equations is a hard task in general: for instance, the decision problem associated 
with finding the minimal weight of LDPC codes is NP-complete [27]. 

We were able to compute three quantities: the cluster diameter, as well as the minimal and 
maximal inter-cluster distances. We believe our method to give a good approximation for 
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systems with large girth, and to be exact in the thermodynamic hmit for random XORSAT. 
In the hne of Survey Propagation, we devised a series of algorithms for these tasks, which 
explicitly exploit the clustered structure of the solution space. More precisely, the space 
of solutions is characterized by two hierarchical levels of fluctuations: inside and between 
clusters. In /c-XORSAT, these two kinds of fluctuations are carried by two disjoint sets of 
variables, and our algorithms explicitly distinguish between these two types of variables. 
In the special case of LDPC codes, the point-like nature of clusters much simplifies the 
equations, and previous expressions of the weight enumerator exponent obtained by the 
replica method are recovered. 

The method presented here offers a number of generalizations. In particular, it could be 
used at finite temperature to yield the full weight enumerator function. More interestingly, it 
could be adapted to deal with other CSN, such as /c-SAT, for which only bounds are known; 
unfortunately, numerical computations are in that case much heavier, albeit formally similar. 
Let us mention that a similar approach was followed in [28] in the case of g-colorability, with 
the difference that distances were estimated from a reference configuration (which is not a 
solution) instead of considering distances between solutions. 

Our work studies the geometrical properties of the solution space by taking explicitly into 
account fluctuations inside clusters, captured by the 'evanescent fields'. This very general 
approach, already explored in [28] , allows to gain a better understanding of the fine structure 
of the clustered phase, and seems to us a promising direction for future work. Also, with 
similar tools, decimation schemes such as the one introduced in [7] could be used to select 
solutions or clusters with particular properties. 

We would like to thank Andrea Montanari for sharing the numerical trick used in the 
replica evaluation of the weight enumerator function of LDPC codes [24]. This work has 
been supported in part by the EU through the network MTR 2002-00319 'STIPCO' and the 
FP6 1ST consortium 'EVERGROW. 
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